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atom type i = lower bucket i 
atom type j slower bucket j 



atom type i = upper bucket i 
atom typej = upper bucket j 
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For most of the descriptors, only non-hydrogen atoms arc 
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sp 2 even though the bonds are drawn differently), we must 
take steps to fix the chemical representation of each 
connection table. Also, we would like to assign formal 
charges to the molecules as they would exist in their 
ionization state at physiological pH. (The formal charges 
already in MACCS may be unreliable for this purpose and 
are ignored.) To accomplish this end, we use the program 
PATTY 7 to locate particular substructures and assign inter- 
mediate types to atoms in the substructures. The library of 
substructures is available as supporting information, and the 
PATTY language is explained in ref 7. The intermediate 
type includes information about the binding property class, 
the hybridization, and the formal charge. For instance, both 
oxygen atoms in the substructure "*-C(=0)-0&Xr'(car- 
boxylate) are assigned the intermediate type "32", indicating 
an anion with one it electron and a formal charge of - 1/2. 

Subsequent programs use the intermediate type to further 
prepare the molecule to be parsed into descriptors. The 
physicochemical class is parsed directly from the intermediate 
type. For the calculation of partial charges, explicit hydro- 
gens are added to heteroatoms based on element, hybridiza- 
tion, and formal charge. For instance, a tertiary amine with 
a formal charge of +1 would receive one hydrogen. The 
method of Gasteiger and Marsili 8 is then applied to the 
molecules to get the final partial charges. 

The algorithm for calculating atomic log P is modified 
from Klopman and Wang. 9 Their original method is used 
for estimating total molecular log P as the sum of contribu- 
tions from 39 molecular fragments, which may contain 1 -4 
non-hydrogen atoms. Our approach is to first assign the 
contribution of each atom by the value for the corresponding 
single-atom fragment. For instance a carbon would start with 
an atomic log P of 0.320. If an atom is in one of the larger 
fragments, the atom gets an additional contribution divided 
by the size of the fragment. For instance if the carbon was 
in the group N=C(-X)-X, one fourth of the group value 
-0. 1 50 would be added to the carbon. In this implementa- 
tion, the atomic log P's sum to the molecular log P'$ as 
calculated by Klopman and Wang. 

After the preparation steps above, the descriptors are 
calculated and stored in a randomly accessible database called 
a topobase. For each entry we store: a molecular identifier, 
an estimated molecular log P, and the number of each type 
of the eight descriptors. For each descriptor type we list 
the unique descriptors present in the molecule and their 
counts. To save space, the unique descriptors are identified 
by two-byte integers. For the property types, the mapping 
to integers is straightforward. For instance, the bp descriptor 
"l-(12)-7" can be written as "010712". However, for ap 
and tt there are too many possible atom types, so each unique 
descriptor must be arbitrarily assigned a unique integer. For 
instance the ap "NX2-(3)-CX3." might be assigned "0 1 2345" 
if it were the 12 345th unique atom pair encountered during 
the construction of a set of topobases. We store and update 
an auxiliary file of the mapping of ap's and tt's to integers 
so that all topobases can use the same mapping. 

In our implementation topobases are stored as mapped files 
on a VAX 8000 using the VMS operating system. Mapped 
files allow for random access of any given molecule by its 
identifier and allow direct mapping of the data into program 
core for rapid reading. 

Definition of Similarity. Throughout we will use the 
index of similarity used by Carhart et al. 3 The similarity of 
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molecules A and B is 

Z min ^A*/B*) 



k k 

where f Ak is the count of descriptor k in molecule A. The 
index k goes over the union of unique descriptors in A and 
B. Sim AB ranges from 0.0 (nothing in common) to 1.0 
(identity), it should be noted that because of the denominator 
molecular size counts as part of the similarity index. 

Calculation of "Fuzziness". Given that our new descrip- 
tors are meant to be less specific, i.e., more "fuzzy", than 
the originals, it would be useful to be able to calculate relative 
fuzziness. We propose two approaches. First, if descriptors 
are specific, any two of them from the same molecule are 
likely to be different. Tf they are fuzzy, any two descriptors 
are likely to be the same. We can monitor this for a single 
molecule by the ratio 

^ _ total no. of descriptors 
no. of unique descriptors 

R can be regarded as a descriptor-based measure of intramo- 
lecular symmetry. For the ap descriptor in the molecule in 
Figure 1, for instance, R = 28/23 = 1.22. The fuzziness 
value for a given descriptor can be taken as the median of R 
over a large sample of molecules. 

A second approach is to monitor how similar any two 
molecules are likely to be. The more fuzzy the descriptor, 
the more two molecules will have in common and the more 
similar they will appear. The fuzziness of a given descriptor 
is taken as the median similarity for a large sample of pairs 
of molecules. 

For both measures it is important to use the same sample 
of molecules when comparing descriptors so the distribution 
of intramolecular symmetry (which affects the first measure) 
and size (which affects the second) will, be constant. 

How Searches Are Run. We run similarity searches with 
our in-house system TOPOSTM. During a search of a 
topobase, TOPOSTM calculates for each database entry the 
similarity for each of the eight descriptors. Within TOPO- 
STM the user has the option to calculate a final score for 
each entry as a user-defined linear combination of the 
individual similarities. For instance the scores for single 
descriptors might be 

ap score = I .()* ap similarity + 0.0* bp similarity + 

0.0 * hp similarity -f- ... 

hp score = 0.0* ap similarity + 1 .0* bp similarity + 

0.0 * hp similarity + ... 

etc. 

For this study we define scores for virtual descriptors called 
combination descriptors as the mean of the similarities for 
two single descriptors. For instance, the combination 
descriptor ap + bp is generated by the combination 

ap + bp score = 0.5 *ap similarity + 

0.5 *bp similarity + 0.0 *hp similarity + ... 
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Sorting of Scores. Once all the scores are calculated for 
a topobase, they are sorted from high to low score. If there 
is more than one fragment per molecule, only the highest 
scoring fragment is kept. Ranks are then assigned: the 
molecule with the highest score is rank 1, the next highest 
rank 2, etc. We use only the ranks of the compounds in 
this study, since the distribution of absolute scores varies 
from one descriptor to another. 

Combination descriptors (see previous section) are one 
way to use two descriptors simultaneously. An alternative 
way, which we call minimum rank sorting, is a postprocess 
that merges the sorted lists of individual descriptors. For 
instance, the virtual minimum rank descriptor mr(ap,bp) is 
generated from the ap and bp lists. We define a new score 
for each compound in the database as its rank in the ap list 
or its rank in the bp list, whichever is smaller. The 
compounds are then sorted by the new score, and new ranks 
are assigned, the smallest score being rank 1, the next rank 
2, etc. This is analogous to the common situation where 
the investigator does separate searches with different descrip- 
tors and submits the union of the top-scoring compounds 
from each. 

Measures of Merit for Similarity Searches. We propose 
two measures to determine whether one set of descriptors is 
better than another based on how well similarity to a 
particular probe correlates with a similar activity. The 
measures depend on a simulated screening experiment. 
Imagine a database with N compounds that contains Nactive 
actives. For a given set of descriptors and a given probe, 
calculate the rank for all molecules as described above. Next^ 
"assay" the compounds in order of ascending rank. We can 
graph the total number of actives found versus the total 
number of compounds tested to see how rapidly actives are 
found. There are two limiting cases. If similarity to the 
probe were a perfect predictor of activity, all actives would 
be at the front of the list, and the curve would start out with 
a slope of 1 and then break to a horizontal line once Nactive 
compounds were tested, if similarity were a very poor 
predictor, actives would be randomly distributed throught 
the list and accumulate according to their frequency in the 
database. The curve would have a nearly constant slope of 
NactiveiN. Actual curves, as we will see, somewhat 
resemble hyperbolic curve; they start with a steep initial rise, 
then level off. Our measures are as follows. 

(1) How many compounds must be tested until half the 
actives are found. We call this number A50, and it is 
analogous with IC50's of binding assays or K M of enzyme 
assays. The smaller A50, the better the similarity method 
in a global sense. This measure is more relevant to assays 
where a very large number of compounds can be tested. A50 
can be alternatively expressed as a global enhancement, the 
ratio of the A50 expected for the random case (N/2) over 
the actual A50. For instance, for A50 = 3000 in a database 
of 30 000 compounds, the global enhancement would be 
30 000/(2 x 3000) = 5.0. 

(2) How many actives are found after testing an arbitrary 
small fraction of the total database. For instance the number 
of actives at 300 compounds tested could be called A@300. 
The larger A@300, the better. (This is analogous to the 
"initial slope" in an enzyme assay.) In some assays, where 
one never tests more than a small percent of a large database, 
this measure could be more useful than A50. A@300 can 
be expressed as an initial enhancement: how many more 
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SEROTONIN GABOXADOL 
seroioninergic gabaminerpic 

Figure 2. Chemical structures of probes used in this study. The 
binding property type for each atom is shown where it is not "6". 
The Derwent Standard Drug File names are used to label the 
molecules. The corresponding generic chemical names are: MOR- 
PHINE, morphine; CYCLIRAMI, cycliramine; DIAZEPAM, di- 
azepam; APOMORPHI, apomorphine; CAPTOPRIL, captopril- 
DIETHYST, diethylstilbesterol; FENOTEROL, fenoterol; RS-86| 
RS-86; SEROTONIN, serotonin; and GABOXADOL, gaboxadol! 

actives there are than expected by chance. For instance for 
A@300 = 150 and a total number of 450 actives in the 
30 000 compounds database, the expected number of actives 
is (300/30 000) x 450 = 4.5, and the enhancement is 150/ 
4.5 = 33.3. 

Database Used in this Study. In order to measure the 
merit of the descriptors we need to have a database of 
molecules for which we know the biological activities. For 
this purpose, we use the Derwent Standard Drug File (SDF), 10 
which is a licensed database of druglike molecules compiled 
from the patent literature. There are -43 000 connection 
tables in the MACCS-compatibie Version 6.0. Most struc- 
tures have one or more key words in the "therapeutic 
category" field. We will assume that a molecule is active 
in the dopaminergic therapeutic area, for instance, if it 
contains the key word "DOPAMINERGICS" in this field. 
Since not every compound has been tested in every area, 
one cannot assume the converse— that a compound without 
this key work is inactive. Thus for any given keyword there 
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Table 1. Activities from SDF Used in This Study 
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activity 



keywords 



narcotic 

antihistamine 

tranquilizer 

dopaminergic 

ace-inhibitor 

estrogen 

sympathomimetic 
parasympathomimetic 
serotoninergic 
gabaminergic 



^^^^^^^ 

™SS" BENZ0DIAZEPM 

"ANGIOTENSIN ANTAGONISTS" 
"ESTROGENS" 

"SYMPATHOMIMETICS -BETA" 
"PARASYMPATHOMIMETICS 1 ' 
"SEROTONINERGICS" 
"GABAMINERGICS" 



comments 



opiate agonists and antagonists 
histamine-Hl antagonists 
mostly benzodiazepine agonists 
dopamine agonists 

mostly ACE inhibitors and analogs of angiotensin 

estrogen agonists 

beta-adrenergic agonists 

muscarinic and nicotinic acetylcholine agonists 

serotonin agonists 

GABA agonists 



no. actives 



478 
408 
399 
268 
218 
215 
188 
139 
70 
58 



are probably some "false inactives". There are -450 distinct 
key words in the SDF. Some key words, for example, 
"DOPAMINERGIC", involve only one mechanism or recep- 
tor. Others, for example "HYPOTENSIVES", involve many 
mechanisms. There has been no effort on the part of the 
compilers of this version SDF to maintain a list of synony- 
mous key words or to subdivide a single therapeutic area by 
mechanism, so we have had to use our judgment in deciding 
what key words are relevant for a particular activity. 

We generated the SDF topobase from the MACCS 
database. The topobase contains 37 005 fragments from 
35 635 molecules. 

Choice of Example Probes for Similarity Searches. 
Although it is possible to define a probe as a composite of 
two or more molecules, we confined ourselves to single 
molecule probes. Chemical structures of the probes used in 
this study (named by the SDF external registry number) with 
the corresponding activity are shown in Figure 2. Table 1 
shows how the activities were constructed from key words 
in SDF. The probes were arbitrarily selected under two 
constraints: (1) The majority of the actives in the therapeutic 
area of the probe should work by the same mechanism as 
the probe. (Given the limitations of the database, there is 
no way to ensure that all actives that work by the same 
mechanism; thus some actives are "false actives" in this 
respect.) (2) There should be >50 actives to ensure 
reasonable statistics. 

It happens that most of the active compounds in therapeutic 
areas that meet the constraints have cationic centers. We 
feel these probes are fairly representative of small druglike 
molecules. 

RESULTS 

Fuzziness of Descriptors. We calculated the median 
value for the descriptor ratio R over all the fragments in the 
SDF using each of the descriptors. We also calculated the 
median pairwise similarity for a 10 000 randomly selected 
pairs of compounds. The results are presented as two series, 
pairs and torsions: 



ap bp hp 



cp 



bt ht 



median R 

median pairwise sim. 



2.17 
0.15 



4.68 5.43 
0.36 0.39 



6.20 
0.43 



1.68 
0.04 



3.75 4.26 4.56 
0.26 0.31 0.35 



The order of fuzziness is the same for both series and for 
both measures: charge > hydrophobic > binding property 
» original. Thus, we are able to demonstrate that the 
physiochemical property descriptors are indeed fuzzy com- 
pared to the original descriptors. Pairs are always more fuzzy 
than the corresponding torsions (e.g., ap > tt, bp > bt, etc ) 

Measures of Merit for Similarity Searches. Figure 3 
shows as an example the graph of the accumulation of actives 
versus rank for the CYCLIRAM1. Table 2 lists the measures 



CYCLIRAMI /antihistamine 




10000 20000 
compounds tested 

CYCLIRAMI /antihistamine 



30000 




0 100 200 300 

compounds tested 

Figure 3. Curves for the accumulation of actives versus rank for 
the CYCLIRAMI example. Two limiting cases are also shown' 
ideal" where all the actives would be at the front of the list and 
"random" where all the actives would be evenly distributed 
throughout the list. a. the curve over the entire database and b 
the curve for the first 300 molecules tested. 

of merit for all probes. Since the list of actives for each 
probe inevitably contains false inactives and false actives, 
the global and initial enhancements reported Table 2 arelikely 
underestimated relative to an ideal list that does not contain 
such "noise". However, in this paper we are not interested 
in absolute enhancements but in comparisons of those 
enhancements among descriptors. The comparisons should 
be valid because, for any probe, every set of descriptors sees 
the same level of noise. 

We can get an idea of the overall utility of the descriptors, 
at least for this set of probes and this database, by taking 
the mean global enhancement and initial enhancement over 
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Table 2, Measures of Merit for Vario us Descriptors 

descriptor A50 global enhancement A@300 initial enhancement descripto 
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ap 


2835 


6.3 


140 


bp 


4750 


3.8 


115 


hp 


7490 


2.4 


101 


cp 


5193 


3.4 


104 


tt 


3108 


5.7 


190 


bt 


1490 


12.0 


121 


hi 


2053 


8.7 


1 16 


ct 


8732 


2.0 


112 


ap+bp 


2862 


6.2 


139 


ap 


1438 


12.4 


68 


bp 


3051 


5.8 


76 


hp 


3699 


4.8 


75 


cp 


71 10 


2.5 


47 


it 


6109 


2.9 


44 


bt 


8864 


2.0 


32 


ht 


4974 


3.6 


42 


ct 


8901 


2.0 


27 


ap+bp 


1304 


13.7 


98 


ap 


2935 


6.1 


102 


bp 


4443 


4.0 


60 


hp 


5172 


3.4 


44 


cp 


7483 


2.4 


52 


tt 


3308 


5.4 


113 


bt 


5120 


3.5 


94 


ht 


7658 


2.3 


63 


ct 


9406 


1.9 


47 


ap+bp 


3316 


5.4 


94 


ap 


3067 


5.8 


48 


bp 


2982 


6.0 


56 


hp 


6311 


2.8 


31 


cp 


1800 


9.9 


51 


it 


3470 


5.1 


28 


bt 


2791 


6.4 


52 


hi 


6283 


2.8 


41 


ct 


2074 


8.6 


45 


ap+bp 


2130 


8.4 


60 


ap 


5798 


3.1 


42 


bp 


18705 


1.0 


18 


hp 


22322 


0.8 


20 


cp 


22101 


0.8 


17 


tt 


541 


32.9 


79 


bt 


4321 


4.1 


40 


ht 


2000 


8.9 


39 


ct 


10943 


1.6 


27 


ap+bp 


12366 


1.4 


30 



A50 global enhancement A@300 initial enhancement 



up 

bp 

hp 

cp 

tt 

bt 

ht 

ct 

ap+bp 

ap 

bp 

hp 

cp 

tt 

bt 

ht 

ct 

ap+bp 

ap 

bp 

hp 

cp 

tt 

bt 

ht 

ct 

ap+bp 



10600 
1483 
1456 
1013 

15196 
2725 
2350 
1064 
2504 

489 
2435 
6264 
4571 

354 
1438 
1396 

637 

801 

6490 

3654 
10177 

4411 
14357 
13306 
15187 
10417 

4204 



1.7 
12.0 
12.2 
17.6 
1.2 
6.5 
7.6 
16.7 
7.1 

36.4 
7.3 
2.8 
3.9 
50.3 
12.4 
12.8 
28.0 
22.2 

2.7 
4.9 
1.8 
4.0 
1.2 
1.3 
1.2 
1.7 
4.2 



37 
67 
49 
84 
44 
33 
43 
53 
38 

83 
56 
38 
40 
91 
52 
46 
78 
76 

9 
23 

7 
24 

7 

7 

10 
10 
31 



MORPHINE/narcotic 
34.8 
28.6 
25.1 
25.9 
47.3 
30.0 
28.8 
27.8 
34.6 

CYC1 
19.8 
22.1 
21.8 
13.7 
12.8 

9.3 
12.2 

7.9 
28.5 

DlAZEPAM/tranquilizer 



ap+cp 


3502 


5.1 


tt+bt 


953 


18.7 


tt+ct 


6604 


2.7 


ap+tt 


1641 


10.9 


mr(ap,hp) 


2834 


6.3 


mr(ap,cp) 


3320 


5.4 


mi(tt,bt) 


1228 


14.5 


mr(//,c/) 


3915 


4.6 


mr(apjt) 


1144 


15.5 


/antihistamine 






ap+cp 


1618 


1 1.0 


tt+bt 


6104 


2.9 


tt+ct 


5889 


3.0 


ap+tt 


3008 * 


5.9 


mr{ap t bp) 


1767 


10.1 


mv(ap,cp) 


1845 


9.7 


mr(ft,bt) 


7808 


2.3 


mr(twt) 


6663 


2.7 


mr (<://;,//) 


2157 


8.3 



30.4 


ap+cp 


41 59 


4.3 


17.9 


tt+bt 


3314 


5.4 


13.0 


tt+ct 


4085 


4.4 


15.5 


ap+tt 


6772 


2.6 


33.6 


mr(apj?p) 


3062 


5.8 


28.0 


mr(ap,cp) 


3983 


4.5 


18.8 


mv(tt,bt) 


3844 


4.6 


14.0 


im(tt,ct) 


5341 


3.3 


28.0 


mr{apjt) 


3159 


5.6 



APOMORPHI/dopaminergic 



21.2 ap+cp 

24.8 tt+bt 
13.7 tt+ct 
22.6 ap+tt ■ 

12.4 mv(ap,bp) 

23.0 \m(ap,cp) 

18.1 mr(//./j/j 

19.9 mr(//,c/) 

26.5 mr(apJt) 

CAPTOPRIL/ace-inhibiior 



22.9 ap+cp 

9.8 tt+bt 

10.9 tt+ct 

9.3 ap+tt 

43.0 \m{ap,bp) 

21.8 mr \ap,cp) 

21.2 \m(tt,bt) 
14.7 mr(//,c/) 

16.3 mr(ap,tt) 
DIETHYLS T/es t roge n 



1448 
1370 
1069 
2338 
2924 
2443 
2630 
2287 
2347 

1 5063 
1416 
2596 
1446 
7703 
8775 
923 
962 
898 



FENOTEROL/sympathomimetic 

52.4 ap+cp 1343 

35.4 tt+bt 289 

24.0 tt+ct 224 
25.3 ap+tt 284 

57.5 im(ap,bp) 552 
32.9 mr(ap,cp) 595 

29.1 mr{tt,bt) 371 
49.3 im(tt,ct) 310 
48.0 mv(ap,it) 334 

RS-86/parasympathomimctic 

7.7 
19.7 

6.0 
20.5 

6.0 

6.0 

8.5 

8.5 
26.4 



12.3 
13.0 
16.7 
7.6 
6.1 
7.3 
6.8 
7.8 
7.6 

1.1 
12.6 

6.9 
12.3 

2.3 

2.0 
19.3 
18.5 
19.8 



20.4 


ap+cp 


3131 


5.7 


37.0 


tt+bt 


4195 


4.2 


27.0 


tt+ct 


2910 


6.1 


46.4 


ap+tt 


12460 


1.4 


24.3 


mi{ap,bp) 


1319 


13.5 


18.2 


mv(ap,cp) 


1259 


14.2 


23.8 


mr(//,Z)/) 


3855 


4.6 


29.3 


mr(//,c/j 


1543 


1 1.6 


21.0 


mr{fl/?,tt) 


13365 


1.3 



13.3 
61.7 

79.5 
62.7 
32.3 
29.9 
48.0 
57,5 
53.3 



138 
I S3 
186 
177 
143 
136 
152 
151 
175 

70 
48 
38 
70 
87 
71 
48 
51 
54 

92 
1 14 
106 
112 

84 

79 
110 

93 
113 

57 
62 
54 
39 
57 
54 
56 
43 
48 

29 
82 
54 
75 
33 
33 
60 
52 
64 

46 
51 
56 
45 
73 
7S 
49 
55 
50 

68 
96 
101 

95 
73 
68 
85 
92 
88 



ap+cp 


2567 


6.9 


29 


tt+bt 


12185 


1.5 


14 


tt+ct 


10537 


1.7 


8 


ap+tt 


7028 


2.5 


9 


mr(apj)p) 


4622 


3.9 


14 


mr(ap,cp) 


3671 


4.9 


16 


mr(tt y bt) 


12264 


1.5 


10 


mr(//,c/) 


11023 


1.6 


8 


mr(apjt) 


6221 


2.9 


5 



34.3 
45.5 
46.3 
44.0 
35.5 
33.8 
37.8 
37.6 
43.5 

20.3 
14.0 
1 1.1 

20.3 
25.3 
20.6 
14.0 
14.8 
15.7 

27.3 
33.9 
31.5 
33.3 
25.0 
23.5 
32.7 
27.7 
33.6 

25.2 
27.4 
23.9 
17.3 
25.3 
24.0 
24.9 
19.1 
21.3 

15.8 
44.7 
29.4 
40.8 
18.0 
18.0 
32.7 
28.3 
34.9 

25.4 
28.2 
30.9 
24.9 
40.3 
43.1 
27.1 
30.4 
27.6 

43.0 
61.8 
63.9 
60. 1 
46.2 
43.0 
53.8 
58.2 
55.7 

24.8 
12.0 

6.8 

7.7 
12.0 
13.7 

8,6 

6.8 

4.3 



124 J. Chem. Inf. Comput. ScL, Vol. 36, No. 1, 1996 
Table 2 (Continued) 

descriptor A50 global enhancement A@300 



KEARSLEY ET AL. 



ap 


2240 


8.0 


17 


bn 
up 


jovh 


3.4 


13 


ho 

rip 


9846 


1 .8 


9 


cn 


2494 


7 1 


18 


it 


2254 


/.V 


23 


bt 




6.2 


25 


ht 


9807 


1 0 


14 


ct 


2914 


0. 1 


20 




JUOO 


5.8 


15 


ap 


6991 


2.5 


13 


bp 


3387 


5.3 


15 


hp 


4024 


4.4 


11 


cp 


5254 


3.4 


13 


U 


16130 


i.l 


3 


bt 


3574 


5.0 


5 


ht 


6714 


2.7 


3 


ct 


3749 


4.8 


1 1 


ap+bp 


1715 


10.4 


19 



initial enhancement descriptor 

SEROTONINVserotoninergi< 
ap+cp 
tt+bt 
tt+ct 
apHt 
mr(ap,bp) 
mv(ap,cp) 
mr(/r,/j/) 
mr(//,c/) 
mr(ap,tt) 

GABOXADOL/gabaminergic 



A50 global enhancement A(a)300 initial enhancement 



28.8 
22.0 
15.3 
30.5 
39.0 
42.4 
23.8 
33.9 
25.5 



1 660 
2438 
2056 
2073 
3380 
2932 
3807 
2875 
2586 



26.6 
30.7 
22.5 
26.6 

6.1 
10.2 

6.1 
22.5 
38.9 



ap+cp 

U+bt 

u+ct 

ap+tt 

mr(apjjp) 

mi(ap.cp) 

mr(ttju) 

mr(//,c/) 

mr(ap,tt) 



2504 
5336 
5045 
8558 
2078 
2874 
6624 
7021 
10814 



10.7 
7.3 
8.7 
8.6 
5.3 
6.1 
4.7 
6.2 
6,9 

7.1 

3.3 
3.5 
2.1 
8,6 
6.2 
2.7 
2.5 
1.6 



18 
27 
25 
21 
16 
18 
25 
22 
21 

16 
6 
10 

7 
15 

.15 
3 

10 
8 



30.5 
45.8 
42.4 
35.6 
27.2 
30.6 
42.5 
37.3 
35.6 

32.8 
12.3 
20.5 
14.3 
30.8 
30.8 
6.2 
20.5 
16.4 



all the probes: 





ap 


bp 


hp 


cp 


tt 


bt 


ht 


ct 


mean glob, enhancement 


8.5 


5.4 


3.7 


5.5 


11.4 


5.9 


5.2 


■ 7.3 


mean init. enhancement 


26.6 


25.0 


18.0 


23.7 


28.3 


22.3 


19.1 


22.8 



All descriptors give enhancements »1, indicating general 
usefulness. The original descriptors appear to be better on 
the average than the property descriptors. Hydrophobic 
descriptors are consistently the worst of the property descrip- 
tors. 

These average numbers are somewhat misleading, how- 
ever. An inspection of Table 2 shows that the effectiveness 
of a given individual descriptor varies greatly with the probe 
and the measure of merit. For MORPHINE, ap is the best 
pair descriptor, and bt is the best torsion descriptor for global 
enhancement; ap and tt are the best for initial enhancement 
For DIAZEPAM and FENOTEROL ap and tt give the best 
results with both measures. For CYCLIRAMI only ap does 
well in global enhancement, and ap, bp, and hp do well in 
initial enhancement. For APOMORPHI property descriptors 
cp and ct do best in global enhancement, and bp and bt do 
best in initial enhancement. For CAPTOPRIL, tt shows 
much better performance than any other descriptor. For 
DIETHYLST, cp and ct. show the best initial and global 
enhancement. For RS-86, only bp and cp show reasonable 
global and initial enhancements. For SEROTONIN, ap and 
tt show the best global enhancements, while ap, cp, tt, and 
bt show the best initial enhancements! 

The explanation seems to lie in the structural requirements 
for activity. By inspecting the appropriate sets of actives it 
is easy to explain in retrospect why one descriptor would 
do better than another in some of the extreme cases. 
Sometimes specific descriptors are best. Most of the ace- 
inhibitors, including CAPTOPRIL, contain proline. The 
proline residue is specified best by tt, and so no other 
descriptor finds actives as effectively as tt. Most sympatho- 
mimetic actives have a secondary amine cation and a 
catechol, as does FENOTEROL; the ap and tt can specify 
the valence and aromaticity for these groups, and the property 
descriptors cannot. In other cases, fuzzy descriptors are best. 
Many of the estrogen actives are steroids. Steroids do not 
resemble DIETHYLST if one looks at the valence and 



hybridization of the atoms, as do the ap and tt, but do 
resemble that probe if one looks only at the polarity of the 
atoms as do all the property descriptors. Many of the 
parasympathomimetic actives have quaternary amines as 
cations and aromatic nitrogens or ether oxygens as hydrogen 
bond acceptors. These are recognized as similar to the 
tertiary amine and carbonyl oxygens in RS-86 by the property 
atom pairs but not by ap. 

Combination Descriptors and Minimum Rank Descrip- 
tors. The idea behind the combination descriptors and 
minimum rank descriptors is that, since one cannot predict 
a priori how well one descriptor will do for a particular probe, 
using two or more descriptors simultaneously might increase 
the chance that reasonably good results could be obtained. 
Since there are 28 possible two-descriptor combinations, we 
could not look at all of them, so we selected five that we 
thought would be representative. Four of the five combine 
an original descriptor (ap or tt) with a corresponding property 
descriptor (bp or cp), so that a middle ground might be 
reached between specificity and fuzziness. The fifth descrip- 
tor combines the original ap and //. The measures of merit 
for the selected combination descriptors and the equivalent 
minimum rank descriptors are also listed in Table 2. 

Naively, one might expect the enhancement of a combina- 
tion descriptor to be roughly halfway between the enhance- 
ments of the component descriptors, but about half the time 
a combination descriptor will do better than both (e.g., for 
CYCLIRAMI the initial enhancement for ap + bp is 28.5 
versus 1 9.8 and 22. 1 for ap and bp, respectively), or at least 
not much worse than the better component descriptor (e.g., 
for MORPHINE initial enhancement for ap + bp is 35.6 
versus 35.9 and 29.5 for ap and bp). There are some 
examples where the combination descriptor does give an 
enhancement halfway between those of the individual 
descriptors. These are mostly from those probes where one 
of the component descriptors is very poor relative to the other 
(e.g., for CAPTOPRIL the initial enhancement for tt + bt is 
12.6 versus 32.9 and 4.1 for tt and bt). A similar situation 
exists for the minimum rank descriptors. 

Table 3 lists the mean values over all probes for the 
combination and minimum rank descriptors compared to the 
component descriptors. Both the combination descriptors 
and minimum rank descriptors on the average do as well or 



Physicochemical Topological Descriptors 

Table 3. Measures of Merit for Pairs of Descriptors Averaged Over All Probes 



./. Chan. inf. Comput. Sci., Vol. 36, No. J, 1996 125 



pair of descriptors 



mean global enhancement 



mean initial enhancement 





op, bp 


ap,cp 




tt.ct 


apjt 


combination 
min rank 
component 
combination 
min rank 
component 


8.5" 

9A>> 

8.5,5.4'' 
29.4 
28.6 

26.5,25.0 


7.8 
9.0 

8.5,5.5 
27.9 
28.1 

26.5,23.6 


13.1 
10.9 
1 1.4,5.9 
32.6 
28.0 

28.2,22.2 


13.3 
1 1.6 
11.4,7.3 
30.7 
28.1 

28.2,22.8 


1 1.7 
12.3 

8.5, 1 1.4 
29.8 

28.9 

26.5,28.2 



' For the combination descriptor ap + bp. * For the minimum rank descriptor mr(ap,bp) 



Table 4. Correlation between Ranks of Active Com pounds for Pairs 
hp 



' For the individual descriptors ap and bp. 



ap 



hp 



of Descriptors 



cp 



it 



hi 



fit 



ap 



ap 

bp 

hp 

cp 

tt 

bt 

ht 

ct 



ap 

bp 

hp 

cp 

tt 

bt 

ht 

ct 



ap 

bp 

hp 

cp 

tt 

bt 

ht 

ct 



MORPHTNE/narcotic 
1-00 0.94 0.92 0.93 0 67 
1.00 0.96 0.97 0.56 
1.00 0.96 0.57 
1-00 0.55 
1.00 



bp hp 



ht 



0.88 0.46 

0.82 0.35 

0.74 0.40 

0.79 0.34 

0.68 0.61 

1.00 0.60 
1.00 



0.78 
0.71 
0.68 
0.72 
0.80 
0.82 
0.59 
1.00 



CYCLIRAMI/antihistamine 
1.00 0.76 0.72 0.66 0.78 0 57 
1.00 0.89 0.75 0.60 0 80 
1.00 0.63 0.52 0.7 
1.00 0.45 0.44 
1.00 0.61 
1.00 



0.48 0.55 

0.61 0.62 

0.83 0.59 

0.26 0.69 

0.40 0.4! 

0.68 0.55 

1.00 0.49 
1.00 



DIAZEPAM/tranquilizer 
1.00 0.66 0.64 0.60 
100 0.62 0.83 
1.00 0.58 
1.00 



0.90 0.59 
0.68 0.84 
0.58 0.44 
0.59 0.63 
1 .00 0.69 
1.00 



0.52 
0.39 
0.83 
0.36 
0.51 
0.37 
1.00 



0.67 
0.69 
0.51 
0.71 
0.72 
0.72 
0.47 
1.00 



ap 

bp 

hp 

cp 

tt 

bt 

ht 

ct 



ap 

bp 

hp 

cp 

tt 

bt 

ht 



ap 

bp 

hp 

cp 

tt 

bt 

hi 



1.00 



0.33 
1.00 



W tt bt 

DIETHYLST/esirogen ' 

0.37 0.30 0.85 0.47 0 40 

0.85 0.88 -0.01 0.52 0 27 

100 0.82 0.05 0.51 0 52 

1.00 -0.02 0.43 0.24 

1.00 0.35 0.31 

1 .00 0.70 
1.00 



1.00 



FENOTEROL/sympathomimetic 
0.65 0.59 0.48 0.26 0.50 
1.00 0.82 0.81 0.22 0 44 
100 0.84 0.39 0.21 

100 0.32 0.16 0.58 

1.00 0.21 0.39 

1 .00 0.39 
1.00 



0.59 
0.63 
0.80 



0.24 
0.39 
0.49 
0.44 
0.10 
0.76 
0.65 
1.00 



0.52 
0.68 
0.68 
0.78 
0.42 
0.33 
0.66 
1.00 



1.00 



0.35 
1 .00 



0.39 
0.63 
1.00 



ap 

bp 

hp 

cp 

tt 

bt 

ht 



ap 

bp 

hp 

cp 

tt 

bt 

ht 



1.00 



APOMORPHI/dopaminergic 
0.85 0.80 0.82 0.51 0 43 
1.00 0.91 0.90 0.27 0.50 
100 0.83 0.16 0.40 
1. 00 0.21 0.38 
1 .00 0.40 
1.00 



irasympathomime 


ic 






0.38 


0.54 


0.61 


0.48 


0.52 


0.84 


0.19 


0.45 


0.32 


0.38 


0.62 


0.30 


0.52 


0.71 


0.34 


1 .00 


0.23 


0.37 


0.38 


0.52 




1.00 


0.54 


0.48 


0.44 






1.00 


0.65 
1. 00 


0.66 
0.57 
1.00 



0.66 0.60 

0.79 0.56 

0.86 0.50 

0.68 0.70 

0.15 0.46 

0.64 0.54 

1.00 0.47 
1 .00 



1.00 



CAPTOPRIL/ace-inhibitor 

0.84 0.81 0.81 0.24 0.74 0 83 

1.00 0.98 0.97 -0.04 0.70 0 72 

1-00 0.98 -0.01 0.60 0.69 

1.00 -0.03 0.62 0.66 

1.00 0.18 0.25 

1.00 0.74 
1.00 



0.61 
0.52 
0.51 
0.56 
0.18 
0.43 
0.57 
1. 00 



ap 

bp 

hp 

cp 

tt 

bt 

ht 

ct 



ap 

bp 

hp 

cp 

tt 

bt 

ht 



1.00 



0.85 
1.00 



SEROTONIN/serotoninergic 
0.74 
0.81 
1 .00 



0.72 


0.74 


0.75 


0.65 


0.72 


0.76 


0.49 


0.69 


0.55 


0.61 


0.67 


0.47 


0.65 


0.78 


0.58 


1 .00 


0.50 


0.64 


0.52 


0.73 




1 .00 


0.75 


0.67 


0.75 






1. 00 


0.65 ■ 
1.00 


0.85 
0.66 
1.00 



1. 00 



0.67 
0.80 



GABOXADOL/gabaminergic 
0.78 0.77 0.71 0.34 
1.00 0.96 0.96 0.36 

1.00 0.92 0.39 0.77 
1.00 0.29 0.70 
1 .00 0.40 
1.00 



0.14 
0.22 
0.29 
0.19 
0.15 
0.39 
1.00 



0.22 
0.29 
0.15 
0.43 
-0.21 
0.22 
0.12 
1.00 



slightly better than the better component descriptor, especially 
for the initial enhancement. The combination descriptors 
and the minimum rank descriptors seem roughly equivalent. 

Correlation of Ranks between Descriptors. Besides 
looking at how well different descriptors do at selecting 
active compounds, one can also look at how they rank the 
actives. Table 4 summarizes the pairwise correlation of ranks 
for actives. If different descriptors were merely expressing 



the same chemical features in a different "notation", one 
would expect to see correlations near I . Some correlations 
in the table are reasonably high, but many are not. Whether 
any two descriptors will give high correlations varies from 
probe to probe with no apparent pattern. Also, for any given 
probe we have not found any relationship of the correlation 
between two descriptors and other types of comparison 
between the descriptors. 
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* CORRELATION OF RANKS IN CYCLIRAMI 

bp vs ap descriptor 
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10000 20000 
rankap 



CORRELATION OF RANKS IN CYCLIRAMI 



bp vs ap descriptor 




1000 



rankap 



p lg £vo1\i??* e CorreIatlon of rank for th e ap and bp descriptors 
for CYCLIRAMI: a. The scatterplot over the entire database. Each 
circle represents a compound in SDF with antihistamine activity 
b. Closeupoftheoriginofa. The rank cutoff at 300 is indicated. 

Even for cases where a correlation coefficient in Table 4 
appears high, when one looks at the correlation in a 
scatterplot, it is startling how much scatter there is. Figure 
4 shows this for the ap and bp descriptors for CYCLIRAMI. 
In this typical example there is a reasonably high overall 
correlation of the ranks (r = 0.76), but the antihistamine 
actives do not fall near the diagonal line, as would be 
expected if the descriptors ranked actives the same. The 
fact that most of the actives fall above the diagonal in Figure 
4a is consistent with the fact that the global enhancement 
for ap is better than bp for CYCLIRAMI. There are very 
many active compounds near the axes, as is more easily seen 
in Figure 4b. These are actives that would be considered 
highly similar to the probe by one descriptor but not very 
similar by another. This means that even if the ap is better 
than bp in selecting actives, it is not safe to ignore bp; there 
are still new actives to be found. 

For any given compound, the difference in ranks between 
any two descriptors, for instance the ap and bp, can be 
quantitated as Iog(rank ap/ rank bp). Figure 5 shows the 
structures of some of the actives that have the most positive 
or negative values. Since the ap descriptor is very specific 
for valence and hybridization, the compounds that have-low 
ranks on ap tend to be close analogs of CYCLIRAMI. The 
compounds with low ranks on bp, a descriptor which ignores 




DIPHENHMI 
rank, ap 1681 
rank bp 160 



Figure 5. Selected antihistamines that have very different ranks 
by the ap and bp descriptors using CYCLIRAMI as the probe, 
these details in favor of physiochemical equivalence, have 
more variations (e.g., substitution of ammonium for tertiary 
amine in DIPHENHMI). 

The observation that the ranks of actives are very sensitive 
to the atom type definition applies to almost any two pairs 
of descriptors for almost all the probes. 

DISCUSSION 

Similarity searches have been performed with a large 
variety of descriptor types. 1 ' 2 In this paper we concentrate 
on variations of the atom pair and topological torsion. The 
new fuzzy descriptors on the average have lesser enhance- 
ments than the original descriptors, bur we have retained all 
eight descriptors in our current version of TOPOSIM for 
reasons discussed below. It is not unusual for a Merck 
scientist to run similarity probes using all eight descriptors 
and combinations thereof and then to take a union of the 
highest scoring compounds from each search. 

We feel it is important to retain and use a variety of 
descriptors as long as it can be shown that a descriptor has 
an average enhancement much better than 1 . The first reason 
is that it is very hard to predict a priori whether a particular 
descriptor will do well in selecting active compounds for a 
particular probe. In retrospect, perhaps we should have 
expected this. Various receptors have varying levels of 
permissiveness and chemical groups on drug molecules that 
appear equivalent to one receptor may appear very different 
to another, and so the degree of fuzziness in the ideal 
descriptor varies from receptor to receptor. Of course, the 
results will also depend on what compounds are in the 
database being searched. . 

The second reason, and perhaps the most important reason 
in respect to the pharmaceutical industry, has to do with how 
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differently any two descriptors rank the same set of active 
compounds. In very many cases we found that active 
molecules that would be in the front of one list would be far 
down on another. It is clear that different descriptors are 
not just expressing the same chemical features in a different 
way; they actually capture very different features. This is 
probably why combination descriptors prove so useful. In 
practical similarity searches, where one takes a relatively 
small number of top-scoring compounds, the result is that 
different descriptors will seem to select different subsets of 
actives. This is very desirable because at the beginning of 
a drug discovery project one is using similarity probes to 
generate as diverse set of actives as possible, and one is 
willing to use a descriptor with a lesser enhancement in order 
to obtain this diversity. 

A third reason for retaining a variety of descriptors is that 
the more fuzzy ones can become more useful as a project 
progresses and one gets a more inclusive idea of the 
physiochemical features in active molecules. For instance, 
one may find that active compounds should contain anions 
and not specifically carboxylates or tetrazoles or hydrophobes 
and not specifically sp 3 carbon. With the proper fuzzy 
descriptor, one can run a properly general similarity search. 
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